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Abstract 

This paper presents an original approach for jointly fitting survival 
times and classifying samples into subgroups. The Coxlogit model is a 
generalized linear model with a common set of selected features for both 
tasks. Survival times and class labels are here assumed to be conditioned 
by a common risk score which depends on those features. Learning is 
then naturally expressed as maximizing the joint probability of subgroup 
labels and the ordering of survival events, conditioned to a common weight 
vector. The model is estimated by minimizing a regularized log-likelihood 
through a coordinate descent algorithm. 

Validation on synthetic and breast cancer data shows that the pro¬ 
posed approach outperforms a standard Cox model or logistic regression 
when both predicting the survival times and classifying new samples into 
subgroups. It is also better at selecting informative features for both tasks. 


1 Introduction 

Survival analysis aims at modeling the relationships between several covariates 
( e.g. age, gender, environmental factors, gene expression values, ...) and the 
time of specific events, such as relapse, metastasis or death. Cox proportional 
hazard models are often used towards this objective |T]. A distinct objective is 
to map the (patient) samples to different subgroups, for instance corresponding 
to specific tumor grades or subtypes. Given a collection of such samples labeled 
by clinicians, this second problem reduces to supervised learning of a classifier 
for which any standard algorithm (SVM, logistic regression, Naive Bayes, ...) 
could be used. 

The originality of this work is to tackle both problems jointly since the spe¬ 
cific subgroups of interest may exhibit distinct risk profiles which, in turn, could 
condition their survival times [5]. We consider in particular generalized linear 
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models as they offer a direct interpretation in terms of individual feature rele¬ 
vances. The proposed Coxlogit model is a natural extension to logistic regression 
for which we assume that the survival times and class labels are random vari¬ 
ables conditioned by a common risk. We show that the partial likelihood of such 
model, to fit the ordering of observed survival times, is directly related to the 
logistic class probabilities. Learning can then be expressed as maximizing the 
joint probability of class labels and the ordering of survival events, conditioned 
to a common weight vector. Embedded feature selection follows naturally when 
fitting such a model with a LASSO or elastic net penalty. Such penalties prevent 
overfitting while enforcing a common sparse support. Learning is also a convex 
problem that can be efficiently solved through a coordinate descent algorithm. 

We report practical experiments both on synthetic and real breast cancer 
datasets. Those experiments show that the Coxlogit approach outperforms ei¬ 
ther a Cox model or logistic regression when both predicting the survival times 
and classifying new samples into subgroups. The proposed approach is also 
better at selecting features that are informative for both tasks simultaneously. 


2 The Coxlogit approach 

One considers a survival analysis framework made of a collection of samples 
and their associated survival times, which are possibly censored. One further 
assumes that each training sample is labeled into a specific subgroup. Formally, 
each sample i £ {l,...,n} is characterized by a 4-tuple ( U,6i,yi,Xi ) where tj 
is the time of an event (such as metastasis or relapse), whenever S t = 1, and 
the censoring time whenever Si = 0. Furthermore, yi denotes a binary class 
label, respectively —1 and 1 for two subgroups of interest. Patients of class 1 
are expected to have a higher risk than patients of the class —1. 

The survival data and the class label of patient i are seen here as two obser¬ 
vations of random variables conditioned by a common risk of event, ?y. This risk 
is simply modeled as a linear combination of the sample covariates (Xi £ R p ): 
Ti = p T Xi but the fit of the parameters /3 should consider both types of super¬ 
vision. 

Starting from the classification viewpoint, a logistic regression predicts from 
the vector Xi the probability of patient i to be in a specific group: 


P(Yi = 1 | Xi ) 


P{Yi = - 11*0 


exp(/? T xQ 
1 + exp(/3 T a; i ) 

1 

1 + exp(/3 T a: i ) 


i-p(r i = i|*0 


(1) 

( 2 ) 


The risk score of a patient, = f3 T Xi, can be interpreted through the logistic 
model as class probabilities: high risk patients are more likely to be in the high 
risk group +1 and a zero risk score corresponds to an equal probability to be 
in either subgroups. The likelihood of the parameters /3 with respect to the 
observed labels yi is given by: 
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(3) 


m = i[p(Y l 

2=1 


n 

Ui\ x i) = n 

i—1 


1 

1 + exp(—yi(/3 T Xi)) 


Looking now at the survival times and knowing that an event occurs at t,. 
one typically computes the probability of patient i having the event over the 
set of patients still at risk just before time t,; : R(tj) = {j\tj > U}. Since high 
risk patients tend to have the event before low risk patients, the likelihood of 
observed events can also be seen as the conditional probability of patient i being 
in the high risk group and all others in the low risk group (knowing that exactly 
one patient has the event before the others at that time). This likelihood can 
also be expressed in terms of the logistic class probabilities P(Yi = 1| Xi) and 
P{Yi = -1| Xi) : 


LiW) 


P(Xi = l\xi)U jeR{ u)\{i} P (*i = -l^j) 
EfcG-R(ii) = M X k) rijefl(ti)\{fc} P(Yj = —1| Xj) 

exp (tS T Xj) -r-r 1 

l+exp(/3 T Xi) 11 l+exp(/3 T Xj) 

E exp (fi T Xk) tt 1_ 

keR(ti) l+exp(/3 1 Xk ) f lj£R(ti)\{k} l+exp(/3 ' Xj) 

exp (/ 3 T Xj ) 

Ea-g r(u) ex P (P Tx k) 


(4) 
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Expression (JgJ , aggregated over all survival times, boils down to the partiaQ 
likelihood of a Cox model for survival data, except that censoring should also be 
considered. Formally, the computation is restricted to those patients for which 
the event is observed (Si = 1). 

The likelihood of the Coxlogit model is now defined as the joint probability of 
the observed events and subgroup labels knowing the parameters /3. Assuming 
the labels and the times to event to be conditionally independent given those 
parameters, this likelihood can be computed as: 


m = n 


f=l 1 + ex P( — yi(f3 T x i)) 


exp (p T Xi) 


-i 5, 


T, je R(U) ex P(P Tx j) 


(7) 


The log-likelihood l(j3) of the Coxlogit model is thus naturally formulated 
as a mixture of a Cox and logistic regression log-likelihoods: 


n 

- KP) = + ex P {~ViP Tx d) 

2=1 

n n 

- ^2 S iP T Xi + 6 i lo g( ex P (P Tx j)) ( 8 ) 

i=1 i=1 oeR(u) 

is called partial as it only relies on the ordering of the events and not the actual times 
when those events occur. 


3 















An embedded feature selection follows by regularizing this objective: 


argmin — — 1(/3) + A S2(/3) 


(9) 


n 


where fl(/3) is a sparsity enforcing regularization such as LASSO or elastic 
net [6], and A > 0 a regularization constant. A coordinate descent algorithm, 
adapted from [3], is used here to solve this convex problem. It starts from a 
trivial solution (/3 = 0) for a large A, and follows the regularization path when 
A is gradually decreased till the model includes a desired number of features (= 
non-zero weight values). 

3 Experiments 

We first consider an artificial dataset to assess to which extent the Coxlogit 
approach is able to select informative features both for classification and survival 
prediction. A data matrix X £ R raxp is drawn from a A/”(0,1) distribution to 
represent covariates that have been centered and normalized to unit variance. 
Those features are partitioned into 4 groups. Each of the 3 first groups includes 
k features which are predictive either of the survival, the group label or both. 
The p — 3k remaining features are purely random and represent noise. 

The hazardfj^] and group assignments are generated from distinct linear com¬ 
binations of the informative features, which are drawn from a uniform dis¬ 
tribution over [-1,-0.5] U [0.5,1]. The survival data ( ti,5i ) are generated 
from two weibull distributions. The distribution for the time to event U is 
parametrized such that the hazards hi(t) only depend on the features from the 
first and second group: hi{t) oc exp(X / 3 1 r ( fc+1 ) 2 fe)- Similarly, the group la¬ 
bels y £ {—1,1}” only depend on the features from the first and third groups: 



y = sign(X/3. 


Results are averaged in the table below over 100 independent runs with 
n = 1000 samples (200 for training, 800 for validation) with p = 100 features 
among which the first k = 10 features are jointly predictive of survival and 
classification. The Coxlogit model is compared to a Cox proportional hazard 
model and logistic regression while following in all cases the regularization path 
till selecting exactly 10 features. The first column illustrates that the Coxlogit 
approach selects more features from the first group, which are those informative 
for group classification and survival prediction. The remaining columns report 
the validation results in terms of classification accuracy, Concordance index 
(measuring the correct ordering of survival times according to risk groups 0]) 
and the harmonic average between those metrics. They illustrate that Coxlogit 
outperforms the other approaches when tackling both tasks. 

"In survival analysis, the hazard is a time dependent function corresponding to the prob¬ 
ability of a patient, still at risk, to experience the event at time t. 
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Methods 

Features 

Accuracy 

C-index 

Predictive Performance 

Coxlogit 

6.59/10 

0.67 

0.80 

.73 

Cox 

4.67/10 

0.59 

0.81 

.68 

Logistic 

4.65/10 

0.71 

0.65 

.68 


We further assess the Coxlogit approach on 4 breast cancer studies (GSE2034, 
GSE5327, GSE7390, GSE2990) from the GEO repository. Those samples are 
gene expression values measured on the Affymetrix HGU133a microarray plat¬ 
form and distant metastasis is used as survival end point. All samples are 
gathered in a common dataset including 554 patients and 1236 features, after 
keeping only the dimensions with the largest variances. The objective is to pre¬ 
dict both the grade of the tumor [2], discretized into low versus high grade with 
roughly equal priors, and the survival probability of the patients. 

Results are reported below over 100 resamplings (without replacement) into 
90% training/10% test over various feature set sizes. The predictive performance 
(averaged over 100 runs) is the harmonic mean between test classification ac¬ 
curacy and Concordance index. Such an aggregate metric is representative of 
the performances on both tasks of grade classification and survival prediction. 
Those results illustrate the overall benefit of the Coxlogit model as compared 
to the original Cox model or logistic regression. 



# Features 

4 Conclusion and perspectives 

This paper describes the Coxlogit method which is a generalized linear model 
to predict survival times and jointly classify samples into subgroups. Once 
regularized with a sparsity inducing term, it offers an embedded feature selection 
to discover informative features for both tasks. We consider here classification 
into 2 specific groups but generalization to multi-class or continuous response 
looks interesting and easy. It would essentially amount to replace the logistic 
part of the objective by its multinomial extension using a softmax function or 
by a square loss. The specific subgroups of interest are here supposed a priori 
known at training time, and typically provided by clinical annotations in a real 
scenario. Such assumption could also be relaxed by considering unsupervised 
or semi-supervised learning of those groups. 
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